Coyoacan
A symbolic Perl algorithm for the unification of Nahuatl word spellings
Guzmán-Landa, Juan-José, Vázquez-Osorio, Jesús, Torres-Moreno, Juan-Manuel, Torres, Ligia Quintana, Figueroa-Saavedra, Miguel, Avendaño-Garrido, Martha-Lorena, Ranger, Graham, Velázquez-Morales, Patricia, Martínez, Gerardo Eugenio Sierra
In this paper, we describe a symbolic model for the automatic orthographic unification of Nawatl text documents. Our model is based on algorithms that we have previously used to analyze sentences in Nawatl, and on the corpus called $π$-yalli, consisting of texts in several Nawatl orthographies. Our automatic unification algorithm implements linguistic rules in symbolic regular expressions. We also present a manual evaluation protocol that we have proposed and implemented to assess the quality of the unified sentences generated by our algorithm, by testing in a sentence semantic task. We have obtained encouraging results from the evaluators for most of the desired features of our artificially unified sentences
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > Mexico > Veracruz > Xalapa (0.04)
- North America > Mexico > Mexico City > Mexico City (0.04)
- (4 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.97)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.97)
Multi-agent Auditory Scene Analysis
Rascon, Caleb, Gato-Diaz, Luis, García-Alarcón, Eduardo
Auditory scene analysis (ASA) aims to retrieve information from the acoustic environment, by carrying out three main tasks: sound source location, separation, and classification. These tasks are traditionally executed with a linear data flow, where the sound sources are first located; then, using their location, each source is separated into its own audio stream; from each of which, information is extracted that is relevant to the application scenario (audio event detection, speaker identification, emotion classification, etc.). However, running these tasks linearly increases the overall response time, while making the last tasks (separation and classification) highly sensitive to errors of the first task (location). A considerable amount of effort and computational complexity has been employed in the state-of-the-art to develop techniques that are the least error-prone possible. However, doing so gives rise to an ASA system that is non-viable in many applications that require a small computational footprint and a low response time, such as bioacoustics, hearing-aid design, search and rescue, human-robot interaction, etc. To this effect, in this work, a multi-agent approach is proposed to carry out ASA where the tasks are run in parallel, with feedback loops between them to compensate for local errors, such as: using the quality of the separation output to correct the location error; and using the classification result to reduce the localization's sensitivity towards interferences. The result is a multi-agent auditory scene analysis (MASA) system that is robust against local errors, without a considerable increase in complexity, and with a low response time. The complete proposed MASA system is provided as a publicly available framework that uses open-source tools for sound acquisition and reproduction (JACK) and inter-agent communication (ROS2), allowing users to add their own agents.
- North America > Mexico > Mexico City > Coyoacan (0.04)
- North America > United States > Massachusetts (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
Evaluating Inter-Column Logical Relationships in Synthetic Tabular Data Generation
Long, Yunbo, Xu, Liming, Brintrup, Alexandra
To evaluate the fidelity of synthetic tabular data, numerous metrics have been proposed to assess accuracy and diversity, including both low-order statistics (e.g., Density Estimation and Correlation Score (Zhang et al., 2023), Average Coverage Scores (Zein & Urvoy, 2022)) and high-order statistics (e.g., α-Precision and β-Recall (Alaa et al., 2022)). However, these metrics operate at a high level and fail to evaluate whether synthetic data preserves logical relationships, such as hierarchical or semantic dependencies between features. This highlights the need for a more fine-grained, context-aware evaluation of multivariate dependencies. To address this, we propose three evaluation metrics: Hierarchical Consistency Score (HCS), Multivariate Dependency Index (MDI), and Distributional Similarity Index (DSI). To assess the effectiveness of these metrics in quantifying inter-column relationships, we select five representative tabular data generation methods from different categories for evaluation. Their performance is measured using both existing and our proposed metrics on a real-world dataset rich in logical consistency and dependency constraints. Experimental results validate the effectiveness of our proposed metrics and reveal the limitations of existing approaches in preserving logical relationships in synthetic tabular data. Additionally, we discuss potential pathways to better capture logical constraints within joint distributions, paying the way for future advancements in synthetic tabular data generation.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Asia > Southeast Asia (0.06)
- (49 more...)
Noncommutative Model Selection and the Data-Driven Estimation of Real Cohomology Groups
Guzmán-Tristán, Araceli, Rieser, Antonio, Velázquez-Richards, Eduardo
We propose three completely data-driven methods for estimating the real cohomology groups $H^k (X ; \mathbb{R})$ of a compact metric-measure space $(X, d_X, \mu_X)$ embedded in a metric-measure space $(Y,d_Y,\mu_Y)$, given a finite set of points $S$ sampled from a uniform distrbution $\mu_X$ on $X$, possibly corrupted with noise from $Y$. We present the results of several computational experiments in the case that $X$ is embedded in $\mathbb{R}^n$, where two of the three algorithms performed well.
- North America > United States > Rhode Island > Providence County > Providence (0.04)
- North America > Mexico > Guanajuato (0.04)
- North America > United States > South Carolina > Richland County > Columbia (0.04)
- (7 more...)
Sustainable Visions: Unsupervised Machine Learning Insights on Global Development Goals
García-Rodríguez, Alberto, Núñez, Matias, Pérez, Miguel Robles, Govezensky, Tzipe, Barrio, Rafael A., Gershenson, Carlos, Kaski, Kimmo K., Tagüeña, Julia
The United Nations 2030 Agenda for Sustainable Development outlines 17 goals to address global challenges. However, progress has been slower than expected and, consequently, there is a need to investigate the reasons behind this fact. In this study, we used a novel data-driven methodology to analyze data from 107 countries (2000$-$2022) using unsupervised machine learning techniques. Our analysis reveals strong positive and negative correlations between certain SDGs. The findings show that progress toward the SDGs is heavily influenced by geographical, cultural and socioeconomic factors, with no country on track to achieve all goals by 2030. This highlights the need for a region specific, systemic approach to sustainable development that acknowledges the complex interdependencies of the goals and the diverse capacities of nations. Our approach provides a robust framework for developing efficient and data-informed strategies, to promote cooperative and targeted initiatives for sustainable progress.
- South America > Uruguay (0.04)
- North America > Mexico > Mexico City > Coyoacan (0.04)
- North America > Haiti (0.04)
- (101 more...)
Design and analysis of tweet-based election models for the 2021 Mexican legislative election
Vigna-Gómez, Alejandro, Murillo, Javier, Ramirez, Manelik, Borbolla, Alberto, Márquez, Ian, Ray, Prasun K.
Modelling and forecasting real-life human behaviour using online social media is an active endeavour of interest in politics, government, academia, and industry. Since its creation in 2006, Twitter has been proposed as a potential laboratory that could be used to gauge and predict social behaviour. During the last decade, the user base of Twitter has been growing and becoming more representative of the general population. Here we analyse this user base in the context of the 2021 Mexican Legislative Election. To do so, we use a dataset of 15 million election-related tweets in the six months preceding election day. We explore different election models that assign political preference to either the ruling parties or the opposition. We find that models using data with geographical attributes determine the results of the election with better precision and accuracy than conventional polling methods. These results demonstrate that analysis of public online data can outperform conventional polling methods, and that political analysis and general forecasting would likely benefit from incorporating such data in the immediate future. Moreover, the same Twitter dataset with geographical attributes is positively correlated with results from official census data on population and internet usage in Mexico. These findings suggest that we have reached a period in time when online activity, appropriately curated, can provide an accurate representation of offline behaviour.
- North America > Mexico > Estado de México (0.14)
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- North America > Mexico > Mexico City > Mexico City (0.06)
- (17 more...)
- Information Technology > Services (1.00)
- Government > Voting & Elections (1.00)
- Government > Regional Government > North America Government > United States Government (1.00)
- Information Technology > Communications > Social Media (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)
Privacy Loss of Noisy Stochastic Gradient Descent Might Converge Even for Non-Convex Losses
The Noisy-SGD algorithm is widely used for privately training machine learning models. Traditional privacy analyses of this algorithm assume that the internal state is publicly revealed, resulting in privacy loss bounds that increase indefinitely with the number of iterations. However, recent findings have shown that if the internal state remains hidden, then the privacy loss might remain bounded. Nevertheless, this remarkable result heavily relies on the assumption of (strong) convexity of the loss function. It remains an important open problem to further relax this condition while proving similar convergent upper bounds on the privacy loss. In this work, we address this problem for DP-SGD, a popular variant of Noisy-SGD that incorporates gradient clipping to limit the impact of individual samples on the training process. Our findings demonstrate that the privacy loss of projected DP-SGD converges exponentially fast, without requiring convexity or smoothness assumptions on the loss function. In addition, we analyze the privacy loss of regularized (unprojected) DP-SGD. To obtain these results, we directly analyze the hockey-stick divergence between coupled stochastic processes by relying on non-linear data processing inequalities.
- North America > United States > California > Santa Clara County > Santa Clara (0.04)
- North America > Mexico > Mexico City > Coyoacan (0.04)
- North America > Canada > Ontario > Hamilton (0.04)
- Europe > Spain > Basque Country > Biscay Province > Bilbao (0.04)
Improving Transfer Learning with a Dual Image and Video Transformer for Multi-label Movie Trailer Genre Classification
Montalvo-Lezama, Ricardo, Montalvo-Lezama, Berenice, Fuentes-Pineda, Gibran
In this paper, we study the transferability of ImageNet spatial and Kinetics spatio-temporal representations to multi-label Movie Trailer Genre Classification (MTGC). In particular, we present an extensive evaluation of the transferability of ConvNet and Transformer models pretrained on ImageNet and Kinetics to Trailers12k, a new manually-curated movie trailer dataset composed of 12,000 videos labeled with 10 different genres and associated metadata. We analyze different aspects that can influence transferability, such as frame rate, input video extension, and spatio-temporal modeling. In order to reduce the spatio-temporal structure gap between ImageNet/Kinetics and Trailers12k, we propose Dual Image and Video Transformer Architecture (DIViTA), which performs shot detection so as to segment the trailer into highly correlated clips, providing a more cohesive input for pretrained backbones and improving transferability (a 1.83% increase for ImageNet and 3.75% for Kinetics). Our results demonstrate that representations learned on either ImageNet or Kinetics are comparatively transferable to Trailers12k. Moreover, both datasets provide complementary information that can be combined to improve classification performance (a 2.91% gain compared to the top single pretraining). Interestingly, using lightweight ConvNets as pretrained backbones resulted in only a 3.46% drop in classification performance compared with the top Transformer while requiring only 11.82% of its parameters and 0.81% of its FLOPS.
- North America > Canada (0.04)
- Oceania > Australia (0.04)
- North America > United States > California (0.04)
- (10 more...)
- Media > Film (1.00)
- Leisure & Entertainment (1.00)
Language statistics at different spatial, temporal, and grammatical scales
Sánchez-Puig, Fernanda, Lozano-Aranda, Rogelio, Pérez-Méndez, Dante, Colman, Ewan, Morales-Guzmán, Alfredo J., Pineda, Carlos, Torres, Pedro Juan Rivera, Gershenson, Carlos
Statistical linguistics has advanced considerably in recent decades as data has become available. This has allowed researchers to study how statistical properties of languages change over time. In this work, we use data from Twitter to explore English and Spanish considering the rank diversity at different scales: temporal (from 3 to 96 hour intervals), spatial (from 3km to 3000+km radii), and grammatical (from monograms to pentagrams). We find that all three scales are relevant. However, the greatest changes come from variations in the grammatical scale. At the lowest grammatical scale (monograms), the rank diversity curves are most similar, independently on the values of other scales, languages, and countries. As the grammatical scale grows, the rank diversity curves vary more depending on the temporal and spatial scales, as well as on the language and country. We also study the statistics of Twitter-specific tokens: emojis, hashtags, and user mentions. These particular type of tokens show a sigmoid kind of behaviour as a rank diversity function. Our results are helpful to quantify aspects of language statistics that seem universal and what may lead to variations.
- North America > Mexico > Mexico City > Mexico City (0.05)
- Europe > Spain > Galicia > Madrid (0.04)
- South America > Argentina > Pampas > Buenos Aires F.D. > Buenos Aires (0.04)
- (11 more...)
Valentines Day sees huge increase in dating and romance scams looking to defraud people looking for love
Valentine's Day is a time to get close to the ones you love. And, just as importantly, not to get close to scammers. The loving feeling that abounds in February, and the sadness it provokes in many single people, are being exploited by fraudsters who use it to steal people's money and infect their computers. Hundreds of millions of fake emails are being sent out that appear as if they are coming from admirers. But if people follow them up they'll just be subject to scams and frauds, or being sent viruses. Bride Amornrat Ruamsin (L), 27, who is a transgender, holds up her five-month-old daughter with her groom Pitchaya Kachainrum (R), 16, during their wedding ceremony organised by a local TV show, in Bangkok, Thailand, February 9, 2018. The ceremony is not legally-binding as Pitchaya in under 17, the legal age for marriage in Thailand.
- Asia > Thailand > Bangkok > Bangkok (0.24)
- Europe > Spain > Galicia > Madrid (0.05)
- North America > Mexico > Mexico City > Mexico City (0.05)
- (29 more...)
- Information Technology > Security & Privacy (1.00)
- Leisure & Entertainment > Social Events (0.87)